Research Question:

What information can you draw from performing basic k-means clustering on the expedition dataset within the Himalayan Climbing Expeditions database? Find how the highpoint elevation relates to member deaths.

Introduction:

Analysis was conducted on the expeditions dataset to determine relationships that are not inherently apparent upon a basic/elementary inspection. K-means clustering was performed on this dataset to identify clusters based on the nearest mean values. With 10,364 rows and 16 columns, this dataset contains detailed remarks of expeditions on different peaks in the Himalayan Mountains. The expeditions dataset provides insights on injuries, deaths, and seasonal treks, from the years 1905 to 2019.

While k-means clustering was performed on the entire dataset, the following numerical and categorical variables within this research document are of main interest: highpoint_metres, members, member_deaths, and season. Performing k-means clustering on this dataset will allow for classification among the expeditions, highpoint, as well as the percentage of member deaths.

Details on the specific columns are shown below:

highpoint_metres: elevation highpoint of the expedition

members: the number of foreigners listed on the expedition permit

member_deaths: number of expeditions members who died

season: season of expedition (spring, summer, etc.)

The dataset analyzed within this research document, expeditions, is shown below:

expeditions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/34eacc4ccf5878769351a5a21d5992eb28f383b6/data/2020/2020-09-22/expeditions.csv')
head(expeditions)
## # A tibble: 6 × 16
##   expedi…¹ peak_id peak_…²  year season basecamp…³ highpoin…⁴ terminat…⁵ termi…⁶
##   <chr>    <chr>   <chr>   <dbl> <chr>  <date>     <date>     <date>     <chr>  
## 1 ANN2601… ANN2    Annapu…  1960 Spring 1960-03-15 1960-05-17 NA         Succes…
## 2 ANN2693… ANN2    Annapu…  1969 Autumn 1969-09-25 1969-10-22 1969-10-26 Succes…
## 3 ANN2731… ANN2    Annapu…  1973 Spring 1973-03-16 1973-05-06 NA         Succes…
## 4 ANN2783… ANN2    Annapu…  1978 Autumn 1978-09-08 1978-10-02 1978-10-05 Bad we…
## 5 ANN2793… ANN2    Annapu…  1979 Autumn NA         1979-10-18 1979-10-20 Bad we…
## 6 ANN2801… ANN2    Annapu…  1980 Spring 1980-03-25 1980-04-24 1980-05-01 Accide…
## # … with 7 more variables: highpoint_metres <dbl>, members <dbl>,
## #   member_deaths <dbl>, hired_staff <dbl>, hired_staff_deaths <dbl>,
## #   oxygen_used <lgl>, trekking_agency <chr>, and abbreviated variable names
## #   ¹​expedition_id, ²​peak_name, ³​basecamp_date, ⁴​highpoint_date,
## #   ⁵​termination_date, ⁶​termination_reason

The database that encompasses the expedition dataset also includes a members dataset as well as a peak dataset. This following research document draws from information within the expedition dataset only.

More information regarding this dataset and parent database can be found at the following link: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-09-22

Pre-processing of the dataset is provided in the code snipet below:

expeditions <- expeditions %>% 
  filter(season != "Unknown") %>% #remove unknown seasons
  dplyr::select(-c("basecamp_date","highpoint_date","termination_date","termination_reason","trekking_agency")) 
head(expeditions)
## # A tibble: 6 × 11
##   expedit…¹ peak_id peak_…²  year season highp…³ members membe…⁴ hired…⁵ hired…⁶
##   <chr>     <chr>   <chr>   <dbl> <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 ANN260101 ANN2    Annapu…  1960 Spring    7937      10       0       9       0
## 2 ANN269301 ANN2    Annapu…  1969 Autumn    7937      10       0       0       0
## 3 ANN273101 ANN2    Annapu…  1973 Spring    7937       6       0       8       0
## 4 ANN278301 ANN2    Annapu…  1978 Autumn    7000       2       0       0       0
## 5 ANN279301 ANN2    Annapu…  1979 Autumn    7160       3       0       0       0
## 6 ANN280101 ANN2    Annapu…  1980 Spring    7000       6       1       2       0
## # … with 1 more variable: oxygen_used <lgl>, and abbreviated variable names
## #   ¹​expedition_id, ²​peak_name, ³​highpoint_metres, ⁴​member_deaths,
## #   ⁵​hired_staff, ⁶​hired_staff_deaths

Because this analysis relies on the seasons column, removal of those expeditions where the season is not known is essential. Only two expeditions of the entire dataset were removed due to having an “Unknown” season. Columns of date type were also removed in this analysis to ensure preservation of as much information as possible while removing NA/NaN values. These columns include: basecamp_date, highpoint_date, termination_date, termination_reason, and trekking_agency.

Approach:

K-means clustering is an unsupervised algorithm that does not make use of labelled data or a training dataset. This type of analysis is important for classification and to maximize the similarity of data points within clusters and minimize the similarity of points in different clusters.

Below are the basic steps of k-means clustering:

  1. Start with k randomly chosen means

  2. Color data points by the shortest distance to any mean

  3. Move means to centroid position of each group of points

  4. Repeat from step 2 until convergence

For this analysis, calculation of the percentage of member deaths was done using the following formula: member_deaths / members * 100 . This was done to find relationships between highpoint_metres and member_deaths in a way that provides insight into how many members were present. Two k-means cluster analyses were performed to show the differences when 0% member deaths were included and when they were excluded.

Because the k-means clustering algorithm is not applicable to categorical data, as categorical variables are discrete and do not have any natural origin, the analysis below is done on numerical data only.

Below shows the computation of the k-means clustering:

km_fit <- na.omit(expeditions) %>% 
  dplyr::select(where(is.numeric)) %>%  #selecting only numeric data
  kmeans(
    centers = 5,  # number of cluster centers
    nstart = 10   # number of independent restarts of the algorithm
  )

Below shows a summary of the k-means clustering analysis performed from the above snipet of code:

summary(km_fit)
##              Length Class  Mode   
## cluster      9948   -none- numeric
## centers        30   -none- numeric
## totss           1   -none- numeric
## withinss        5   -none- numeric
## tot.withinss    1   -none- numeric
## betweenss       1   -none- numeric
## size            5   -none- numeric
## iter            1   -none- numeric
## ifault          1   -none- numeric

Analysis:

The below code and graph show k-means clustering for the expeditions dataset with 5 clusters:

# plot
km_fit <- na.omit(expeditions) %>% #omitting NA/NaNs
  dplyr::select(where(is.numeric)) %>% #selecting only numeric data
  kmeans(
    centers = 5,  # number of cluster centers
    nstart = 10   # number of independent restarts of the algorithm
  )
km_fit %>%
  # combine with original data
  augment(na.omit(expeditions)) %>%
  ggplot(
  aes(highpoint_metres, member_deaths/members*100), #percentage of member deaths calculated
  ) +
  geom_point(
    aes(color = .cluster, #color by cluster
    shape = season, 
    size = 2,
    alpha=0.9)
  ) +
  geom_point( #points of the clusters themselves
    data = tidy(km_fit),
    aes(fill = cluster),
    shape = 21, color = "black", size = 4
  ) +
  ggtitle("Clusters") +
  labs( #adding labels
    x = "Highpoint Metres",
    y = "Percentage of Member Deaths (%)",
    fill = "Cluster",
    shape = "Season",
    subtitle = "Figure 1",
    caption = "*Expeditions with zero deaths were incuded within this analysis",
    ) +
  guides(
    color = "none"
    ) +
  scale_fill_manual(
    values = c("1"='#1b9e77',"2"='#d95f02',"3"='#7570b3',"4"='#e7298a',"5"='#66a61e') #custom palette
    ) +
  scale_color_manual(
    values = c("1"='#1b9e77',"2"='#d95f02',"3"='#7570b3',"4"='#e7298a',"5"='#66a61e') #custom palette
    ) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "top",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank(),
    legend.text=element_text(size=7),
    legend.spacing.y = unit(0.0, 'cm'),
    ) + 
  scale_alpha(
    guide = 'none'
    ) +
  scale_size(
    guide = 'none'
    )

There are 565 expeditions in which a member died. The below graph shows k-means clustering done on only the expeditions where a death occurred, again, with 5 clusters:

# plot
expeditions <- expeditions %>%  filter(member_deaths != 0)
km_fit <- na.omit(expeditions) %>% 
  dplyr::select(where(is.numeric)) %>% 
  kmeans(
    centers = 5,  # number of cluster centers
    nstart = 10   # number of independent restarts of the algorithm
  )
km_fit %>%
  # combine with original data
  augment(na.omit(expeditions)) %>%
  ggplot(
  aes(highpoint_metres, member_deaths/members*100), #percentage of member deaths calculated
  ) +
  geom_point(
    aes(color = .cluster, #color by cluster
    shape = season, 
    size = 2,
    alpha=0.9)
  ) +
  geom_point( #points at center of cluster
    data = tidy(km_fit),
    aes(fill = cluster),
    shape = 21, color = "black", size = 4
  ) +
  ggtitle("Clusters") +
  labs( #adding labels
    x = "Highpoint Metres",
    y = "Percentage of Member Deaths (%)",
    fill = "Cluster",
    shape = "Season",
    subtitle = "Figure 2",
    caption = "*Expeditions with zero deaths were excluded from this analysis",
    ) +
  guides(
    color = "none"
    ) +
  scale_fill_manual(
    values = c("1"='#1b9e77',"2"='#d95f02',"3"='#7570b3',"4"='#e7298a',"5"='#66a61e') #custom palette
    ) +
  scale_color_manual(
    values = c("1"='#1b9e77',"2"='#d95f02',"3"='#7570b3',"4"='#e7298a',"5"='#66a61e') #custom palette
    ) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "top",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank(),
    legend.text=element_text(size=7),
    legend.spacing.y = unit(0.0, 'cm'),
    ) + 
  scale_alpha(
    guide = 'none'
    ) +
  scale_size(
    guide = 'none'
    )

The below code and figure show the scree plot of the k-means clusters for the expeditions dataset where a death occurred:

# function to calculate within sum squares
calc_withinss <- function(data, centers) {
  km_fit <- dplyr::select(data, where(is.numeric)) %>%
    kmeans(centers = centers, nstart = 10)
  km_fit$tot.withinss
}
tibble(centers = 1:15) %>%
  mutate(
    within_sum_squares = map_dbl(
      centers, ~calc_withinss(iris, .x)
    )
  ) %>%
  ggplot() +
  aes(centers, within_sum_squares) +
  geom_point(color = "#d95f02",size=3) +
  geom_line(color = "#d95f02", size=1.3) +
  ggtitle("Sum of Squares Scree Plot") +
  labs( #adding labels
    x = "Number of Clusters",
    y = "Within Sum of Squares",
    subtitle = "Figure 3",
    caption = "*Expeditions with zero deaths were excluded from this analysis",
    ) +
  scale_color_manual(
    values = c('#1b9e77','#d95f02','#7570b3','#e7298a','#66a61e') #custom palette
    ) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "top",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank(),
    legend.text=element_text(size=7),
    legend.spacing.y = unit(0.0, 'cm'),
    )

Discussion:

K-means clustering is an essential part of machine learning and data science because it allows for better interpretation of data. This type of analysis is important to optimize/maximize the similarity of data points within clusters. The clusters generated from this analysis are used for classification.

The analysis done on the expeditions dataset shows various relationships. Figure 1 describes the k-means clustering performed on the expeditions dataset. This basic analysis shows clusters that are most notably divided by the highpoint elevations of the expedition. Because it is difficult to draw conclusions from this analysis due to the influx of expeditions with 0% recorded member deaths, additional analysis was performed omitting that data.

In Figure 2, the clusters generated by the k-means clustering analysis show that generally the lower the highpoint elevation of the expedition, the greater the percentage of deaths. This figure gives better insight into what seasons the highest percentage of deaths occur. There are other variables that factor into what seasons has the highest percentage of deaths, such as highpoint_metres, shown on the x-axis. The clusters in this graph vary slightly from the clusters in Figure 1, depicting the expeditions with 0%+ member deaths.

Finally, Figure 3, titled: Sum of Squares Scree Plot, shows the sum of square means across different cluster amounts. As the number of clusters increases, the variance (within-group sum of squares) decreases. The elbow at three clusters represents balance between minimizing the number of clusters and minimizing the variance within each cluster, achieving parsimony within these parameters.